# SUMMARY

This work aimed to transform raw data in high quality and well organized data for research studies addressing genetics and neurodevelopmental disorders. Information and relations between patients, cnvs, genes, GO terms, and diagnoses where passed through a very demanding quality check analysis before being inserted in the relational database in order to eliminate redundancies and enhance uniformity whenever possible. By using this data, researchers can start their work one step further by querying and identyfying data for analysis rather than spent time in tasks related to data cleaning and data pre-processing. 

# CONTENTS
1. Raw Data
2. Data Quality Check
3. 3th-party Data Integration
4. The Database
5. Dopaminergic Datasets

## 1. Raw Data

Domain public data downloaded from [Sfari Gene Copy Number Variant Module](https://gene-archive.sfari.org/about-copy-number-variants/) on June of 2017.

The Sfari Gene CNV Module is a data collection of all CNVs that have been associated with ASD. The file _individual_data.csv_ contain information of published reports of autism and identifyed CNVs. This studies report profiles on affected individuals from which Sfari Gene extract the following information:

- **CNV Locus** A field where is reported a fixed position on a chromosome where a particular genenetic variation was observed in an individual.
- **Patient Id**    A field where is assigned an individual unique anounimous identification consisting of the name of the cohort to which the individual belongs, followed by a tag that differentiates that individual from others in the same cohort.
- **Report Id** A field where is attributed a unique identification of a publish report from which the information was gathered.
- **Case/Control**  A field describing the role of a particular individual in the study.
- **Patient Age**   A field where is recorded the age of the individual at date of the study occurrence.
- **Patient Gender**    A field denoting the gender of an individual.
- **Primary Diagnosis** A field where is described the primary diagnosis of an individual
- **Clinical Profile**  A field containing a broad range of an individual clinical information. It can include the clinical history, dysmorphic features, comorbidities, height, weight, head circumference, and more.
- **Cognitive Profile** A field where is recorded the IQ scores of an individual.
- **CNV Start** A field where is reported the starting chromossomic coordinates of a particular genetic variation.
- **CNV End**   A field where is reported the ending chromossomic coordinates of a particular genetic variation.
- **CNV Size**    A field where is reported the size in base pairs of a particular genetic variation.
- **CNV Type**  A field where is indicated the type of the genetic alteration observed in a particular CNV.
- **Genome Build**  A field reporting the human genome reference from which the CNV was compared to.
- **Validation**    A field where is indicated if a CNV was validated by a different sequencing machine.
- **Validation Descritpion**    A field describing the validation applyed in the _Validation_ field.
- **Primary Disorder Inheritance**  
- **Inheritance**   A field where is reported if the CNV arised de novo or if was inherited through either the maternal or paternal chromosomes.
- **Family Profile** A field listening the individual family as either simplex or multiplex.
- **Disease Segregation**   A field indicating the closeness between a CNV a neurodevelopmental disorder.
- **Gene Content**  A field lintening the genes contained in the CNV.
- **Altered Gene Expression**   A field where is reported any alteration in the expression of a gene contained in the CNV.

## 2. Data Quality Check

In order to populate the database with high quality data a set of rules was applied to validate the information contained in each row of the raw data. For a particular row the data was only inserted in the database if it was in accordance with the following settings:

1. _Patient Id_ was present.
2. To be a valid Diagnosis, it requires that _Primary Diagnosis_ was present and matched a regular expression.
3. For a CNV to be valid, the presence of _CNV Locus_, _CNV Start_, _CNV End_ and _Genome Build_ equal to GRCH38/HG38 was required. In the case of a CNV _Genome Build_ being in a different reference a succefull [LiftOver](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) operation to Hg38 reference was an additional requirement to validate.

Additionally, the following fields were tested against regular expressions to reduce redundancies present in the raw data and to enhance uniformity in the database:

- _Patient Gender_
- _Patient Age_
- _Family Profile_
- _CNV Type_
- _Inheritance_

## 3. 3th-party Data Integration

To improve the information related to CNVs inserted in the database, the genes contained within each alterations, as well as, the [Gene Ontology](http://geneontology.org/) terms associated to each gene were added via [BimoaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) queries.

## 4. The Database

The described previous steps resulted in a SQL Database named *sfari_gene_cnv.sql* with the following tables:

                mysql> SHOW TABLES;
                +--------------------------+
                | Tables_in_sfari_gene_cnv |
                +--------------------------+
                | cnvs                     |
                | cnvs_genes               |
                | cnvs_patients            |
                | diagnoses                |
                | diagnoses_patients       |
                | genes                    |
                | genes_go_annotations     |
                | go_annotations           |
                | inheritances             |
                | patients                 |
                | schema_info              |
                | variation_types          |
                +--------------------------+

The single named tables, namely _patients_, _cnvs_, _diagnoses_, _genes_, and *go_annotations* tables, are the main tables of the database whereas the double named tables like *cnvs_patients*, *diagnoses_patients*, *cnvs_genes* and *genes_go_annotations* map **many-to-many** relationships between rows in the single named tables. The _inheritances_ and *variation_types* tables were used to further described a CNV of an individual in *cnvs_patients* table using a **one-to-many** relationship. The number of rows and a description of each sigle named table is presented below:

- The _patients_ table map 28717 unique patients.

                mysql> DESCRIBE patients;
                +----------------+--------------+------+-----+---------+----------------+
                | Field          | Type         | Null | Key | Default | Extra          |
                +----------------+--------------+------+-----+---------+----------------+
                | id             | int(11)      | NO   | PRI | NULL    | auto_increment |
                | source_id      | varchar(255) | YES  |     | NULL    |                |
                | age            | int(11)      | YES  |     | NULL    |                |
                | gender         | varchar(255) | YES  |     | NULL    |                |
                | family_profile | varchar(255) | YES  |     | NULL    |                |
                +----------------+--------------+------+-----+---------+----------------+

- The _cnvs_ table map 5646 unique cnvs:

                mysql> DESCRIBE cnvs;
                +--------------+--------------+------+-----+---------+----------------+
                | Field        | Type         | Null | Key | Default | Extra          |
                +--------------+--------------+------+-----+---------+----------------+
                | id           | int(11)      | NO   | PRI | NULL    | auto_increment |
                | chromosome   | varchar(255) | YES  |     | NULL    |                |
                | start        | int(11)      | YES  |     | NULL    |                |
                | stop         | int(11)      | YES  |     | NULL    |                |
                | locus        | varchar(255) | YES  |     | NULL    |                |
                | genome_build | varchar(255) | YES  |     | NULL    |                |
                +--------------+--------------+------+-----+---------+----------------+

- The _genes_ table map 58116 unique genes.

                mysql> DESCRIBE genes;
                +--------------+--------------+------+-----+---------+----------------+
                | Field        | Type         | Null | Key | Default | Extra          |
                +--------------+--------------+------+-----+---------+----------------+
                | id           | int(11)      | NO   | PRI | NULL    | auto_increment |
                | ensembl_id   | varchar(255) | YES  |     | NULL    |                |
                | name         | varchar(255) | YES  |     | NULL    |                |
                | chromosome   | varchar(255) | YES  |     | NULL    |                |
                | start        | int(11)      | YES  |     | NULL    |                |
                | stop         | int(11)      | YES  |     | NULL    |                |
                | band         | varchar(255) | YES  |     | NULL    |                |
                | genome_build | varchar(255) | YES  |     | NULL    |                |
                +--------------+--------------+------+-----+---------+----------------+

- The *go_annotations* table map 18314 unique GO terms.

                mysql> DESCRIBE go_annotations;
                +---------------------+--------------+------+-----+---------+----------------+
                | Field               | Type         | Null | Key | Default | Extra          |
                +---------------------+--------------+------+-----+---------+----------------+
                | id                  | int(11)      | NO   | PRI | NULL    | auto_increment |
                | go_id               | varchar(255) | YES  |     | NULL    |                |
                | domain              | varchar(255) | YES  |     | NULL    |                |
                | evidence_code       | varchar(255) | YES  |     | NULL    |                |
                | name                | varchar(255) | YES  |     | NULL    |                |
                | information_content | double       | YES  |     | 0       |                |
                +---------------------+--------------+------+-----+---------+----------------+


- The _diagnoses_ table map 36 unique diagnoses.

                mysql> DESCRIBE diagnoses;
                +-------+--------------+------+-----+---------+----------------+
                | Field | Type         | Null | Key | Default | Extra          |
                +-------+--------------+------+-----+---------+----------------+
                | id    | int(11)      | NO   | PRI | NULL    | auto_increment |
                | name  | varchar(255) | YES  |     | NULL    |                |
                +-------+--------------+------+-----+---------+----------------+


- The *cnvs_patients* table map 68811 realations between patients and cnvs, as well as, the type and the inheritance of a patient-cnv pair.

                mysql> DESCRIBE cnvs_patients;
                +-------------------+---------+------+-----+---------+-------+
                | Field             | Type    | Null | Key | Default | Extra |
                +-------------------+---------+------+-----+---------+-------+
                | cnv_id            | int(11) | NO   | PRI | NULL    |       |
                | patient_id        | int(11) | NO   | PRI | NULL    |       |
                | variation_type_id | int(11) | YES  | MUL | NULL    |       |
                | inheritance_id    | int(11) | YES  | MUL | NULL    |       |
                +-------------------+---------+------+-----+---------+-------+

## 5. Dopaminergic Datasets

The datasets used to study neurodopaminergic mechanisms and their associations in participants with Autism and Develpmanetal Delay diagnoses. Inside the **networks folder** are three files used to built networks and to study their properties:
- *gene_dosage_vectors_dopamine_network.graphml*: cointains the links between genes involved in dopaminergic aspects and the participants.
- *go_vectors_dopamine_network.graphml*: cointains the links between of GO terms associated to genes involved in dopaminergic aspects and the participants.
- *gene_dosage_go_vectors_dopamine_network.graphml*: cointains the links between genes involved in dopaminergic aspects, the associated GO terms and the participants.
 These can be easly imported with Python [NetworkX](https://networkx.org/) or [Gephi](https://gephi.org/) software. 

The three datasets used in the machine learning approach are inside the **machine learning folder** and can be imported and reused with Python [SciKit-Learn](https://sklearn.org/) toolbox:
- *gene_dosage_vectors_dopamine.csv*: cointains gene dosage features related to dopaminergic aspects used to predict the participants diagnostic.
- *go_vectors_dopamine.csv*: cointains GO features related to dopaminergic aspects used to predict the participants diagnostic.
- *gene_dosage_go_vectors_dopamine.csv*: combine gene dosage and GO features related to dopaminergic aspects used to predict the participants diagnostic.